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Abstract 


Data-driven representation learning for 
words is a technique of central importance 
in NLP. While indisputably useful as a 
source of features in downstream tasks, 
such vectors tend to consist of uninter¬ 
pretable components whose relationship to 
the categories of traditional lexical seman¬ 
tic theories is tenuous at best. We present 
a method for constructing interpretable 
word vectors from hand-crafted linguis¬ 
tic resources like WordNet, FrameNet etc. 
These vectors are binary (i.e, contain only 
0 and 1) and are 99.9% sparse. We analyze 
their performance on state-of-the-art eval¬ 
uation methods for distributional models 
of word vectors and find they are competi¬ 
tive to standard distributional approaches. 


1 Introduction 


Distributed representations of words have been 
shown to benefit a diverse set of NLP tasks in¬ 


cluding syntactic parsing dLazaridou et al., 2013 


[Bansal et al., 2014] ), named entity recogni¬ 
tion dGuo et al., 2014| ) and sentiment analysis 
dSocher et al., 2013] ). Additionally, because they 
can be induced directly from unannotated cor¬ 
pora, they are likewise available in domains and 
languages where traditional linguistic resources 
do not exhaust. Intrinsic evaluations on various 
tasks are helping refine vector leaiming methods 
to discover representations that captures many 


facts about lexical semantics (Turney, 2001 


Turney and Pantel, 20101. 


Yet induced word vectors do not look anything 
like the representations described in most lexi¬ 
cal semantic theories, which focus on identifying 
classes of words dLevin, 199^ [Baker et al., 1998t 
Schuler, 2005 [ [Miller, 1995 1 ). Though expensive 


to construct, conceptualizing word meanings sym¬ 


bolically is important for theoretical understand¬ 
ing and interpretability is desired in computational 
models. 

Our contribution to this discussion is a 
new technique that constructs task-independent 
word vector representations using linguistic 
knowledge derived from pre-constructed lin¬ 


guistic resources like WordNet ([Miller, 1995 [I, 


FrameNet (Baker et al., 19981, Penn Treebank 


( [Marcus et al., 1993] ) etc. In such word vectors ev¬ 
ery dimension is a linguistic feature and 1/0 in¬ 
dicates the presence or absence of that feature in 
a word, thus the vector representations are binary 
while being highly sparse (« 99.9%). Since these 
vectors do not encode any word cooccurrence in¬ 
formation, they are non-distributional. An addi¬ 
tional benefit of constructing such vectors is that 
they are fully interpretable i.e, every dimension 
of these vectors maps to a linguistic feature un¬ 
like distributional word vectors where the vector 
dimensions have no interpretability. 

Of course, engineering feature vectors from lin¬ 
guistic resources is established practice in many 
applications of discriminative learning; e.g., pars¬ 
ing ( [McDonald and Pereira, 20()6t [Nivre, 2008[ ) 


or part of speech tagging (Ratnaparkhi, 1996 


Collins, 2002[ ). However, despite a certain com¬ 
mon inventories of features that re-appear across 
many tasks, feature engineering tends to be seen 
as a task-specific problem, and engineered feature 
vectors are not typically evaluated independently 
of the tasks they are designed for. We evaluate the 
quality of our linguistic vectors on a number of 
tasks that have been proposed for evaluating dis¬ 
tributional word vectors. We show that linguistic 
word vectors are comparable to current state-of- 
the-art distributional word vectors trained on bil¬ 
lions of words as evaluated on a battery of seman¬ 
tic and syntactic evaluation benchmarks^ 


‘Our vectors can be downloaded at: 
https://github.com/mfaruqui/non-distributional 










































Lexicon 

Vocabulary 

Features 

WordNet 

Supersense 

FrameNet 

Emotion 

Connotation 

Color 

Part of Speech 
Syn. & Ant. 

10,794 

71,836 

9,462 

6,468 

76,134 

14,182 

35,606 

35,693 

92,117 

54 

4,221 

10 

12 

12 

20 

75,972 

Union 

119,257 

172,418 


Table 1: Sizes of vocabualry and features induced 
from different linguistic resources. 


word types evoking the same frame should be 
semantically related. For every word, we use 
the frame it evokes along with the roles of 
the evoked frame as its features. Since, infor¬ 
mation in FrameNet is part of speech (POS) 
disambiguated, we couple these feature with 
the corresponding POS tag of the word. For 
example, since appreciate is a verb, it will have 
the following features: Verb.Frame.Regard, 
Verb.Frame.Role.Evaluee etc. 


2 Linguistic Word Vectors 

We construct linguistic word vectors by extracting 
word level information from linguistic resources. 
Table [U shows the size of vocabulary and number 
of features induced from every lexicon. We now 
describe various linguistic resources that we use 
for constructing linguistic word vectors. 


WordNet. WordNet dMiller, 1995| ) is an English 
lexical database that groups words into sets of 
synonyms called synsets and records a num¬ 
ber of relations among these synsets or their 
members. Eor a word we look up its synset 
for all possible part of speech (POS) tags that 
it can assume. Eor example, film will have 
Synset.Eilm.V.01 and Synset.Eilm.N.01 as 
features as it can be both a verb and a noun. In ad¬ 
dition to synsets, we include the hyponym (for ex. 
Hypo.CollageEilm.N.01), hypernym (for ex. 
Hyper:Sheet.N.06) and holonym synset of the 
word as features. We also collect antonyms and 
pertainyms of all the words in a synset and include 
those as features in the linguistic vector. 

Supsersenses. WordNet partitions nouns and 
verbs into semantic field categories known 
as supsersenses dCiaramita and Altun, 2006 


Nastase, 2008D . Eor example, lioness evokes 
the supersense SS. Noun.Animal. These 
supersenses were further extended to adjectives 
dTsvetkov et al., 20T4l )B We use these supsersense 
tags for nouns, verbs and adjectives as features in 
the linguistic word vectors. 


FrameNet. ErameNet dBaker et aL, 1998 
Eillmore et al., 2003]) is a rich linguistic re 


source that contains information about lexical 
and predicate-argument semantics in English. 
Erames can be realized on the surface by many 
different word types, which suggests that the 


Emotion & Sentiment. 

Mohammad and Turney (20131 constructed 
two different lexicons that associate words to 
sentiment polarity and to emotions resp. using 
crowdsourcing. The polarity is either positive 
or negative but there are eight different kinds of 
emotions like anger, anticipation, joy etc. Every 
word in the lexicon is associated with these prop¬ 
erties. Eor example, cannibal evokes Pol.Neg, 
Emo.Disgust and Emo.Eear. We use these 
properties as features in linguistic vectors. 


Connotation. Eeng et al. (2013|l construct a lex¬ 


icon that contains information about connota¬ 
tion of words that are seemingly objective but 
often allude nuanced sentiment. They as¬ 
sign positive, negative and neutral connotations 
to these words. This lexicon differs from 


Mohammad and Turney (2013 I in that it has a 


more subtle shade of sentiment and it extends to 
many more words. Eor example, delay has a neg¬ 
ative connotation Con.Noun.Neg, fioral has a 
positive connotation Con.Adj.Pos and outline 
has a neutral connotation Con.Verb.Neut. 

Color. Most languages have expressions involv¬ 
ing color, for example green with envy and grey 
with uncertainly are phrases used in English. 
The word-color associtation lexicon produced by 
Mohammad (2011) using crowdsourcing lists the 
colors that a word evokes in English. We use ev¬ 
ery color in this lexicon as a feature in the vector. 
Eor example, COLOR.Red is a feature evoked by 
the word blood. 

Part of Speech Tags. The Penn Treebank 
dMarcus et al., 199^ annotates naturally occur¬ 
ring text for linguistic structure. It contains syn¬ 
tactic parse trees and POS tags for every word in 
the corpus. We collect all the possible POS tags 
that a word is annotated with and use it as features 


"http://WWW.cs.emu.eduZ-ytsvetko/adj-supeuaethfifehrigiiist^ vectors. Eor example, love has 































Word 

POL.POS 

Color.Pink 

SS.Noun.Feeling 

PTB.Verb 

Anto.Fair 


Con.Noun.P os 

love 

1 

1 

1 

1 

0 


1 

hate 

0 

0 

1 

1 

0 


0 

ugly 

0 

0 

0 

0 

1 


0 

beauty 

1 

1 

0 

0 

0 


1 

refundable 

0 

0 

0 

0 

0 


1 


Table 2: Some linguistic word vectors. 1 indicates presence and 0 indicates absence of a linguistic 
feature. 


PTB.Noun, PTB.Verb as features. 

Synonymy & Antonymy. We use Roget’s the¬ 


saurus ( Roget, 1852 1 to collect sets of synony¬ 
mous words H For every word, its synonymous 
word is used as a feature in the linguistic vec¬ 
tor. For example, adoration and affair have 
a feature Syno.Love, admissible has a fea¬ 
ture Syno.Acceptable. The synonym lexi¬ 
con contains 25,338 words after removal of mul¬ 
tiword phrases. In a similar manner, we also 
use antonymy relations between words as fea¬ 
tures in the word vector. The antonymous 
words for a ^en word were collected from 
Ordway (1913|)0 An example would be of im¬ 


partiality, which has features Anto.FAVORITISM 
and Anto.Injustice. The antonym lexicon has 
10,355 words. These features are different from 
those induced from WordNet as the former en¬ 
code word-word relations whereas the latter en¬ 
code word-synset relations. 

After collecting features from the various lin¬ 
guistic resources described above we obtain lin¬ 
guistic word vectors of length 172,418 dimen¬ 
sions. These vectors are 99.9% sparse i.e, each 
vector on an average contains only 34 non-zero 
features out of 172,418 total features. On average 
a linguistic feature (vector dimension) is active for 
15 word types. The linguistic word vectors con¬ 
tain 119,257 unique word types. Table |2] shows 
linguistic vectors for some of the words. 

3 Experiments 

We first briefly describe the evaluation tasks and 
then present results. 

3.1 Evaluation Tasks 

Word Similarity. We evaluate our word repre¬ 
sentations on three different benchmarks to mea¬ 
sure word similarity. The first one is the widely 
used WS-353 dataset dFinkelstein et al., 2001| ), 


which contains 353 pairs of English words 
that have been assigned similarity ratings by 
humans. The second is the RG-65 dataset 
([Rubenstein and Goodenough, 1965[) of 65 


words pairs. The third dataset is SimLex 
dHill et al, 2014| l which has been constructed 
to overcome the shortcomings of WS-353 and 
contains 999 pairs of adjectives, nouns and 
verbs. Word similarity is computed using cosine 
similarity between two words and Spearman’s 
rank correlation is reported between the rankings 
produced by vector model against the human 
rankings. 


Sentiment Analysis. [Socher et al. (2013] ) cre¬ 
ated a treebank containing sentences annotated 
with fine-grained sentiment labels on phrases and 
sentences from movie review excerpts. The 
coarse-grained treebank of positive and negative 
classes has been split into training, development, 
and test datasets containing 6,920, 872, and 1,821 
sentences, respectively. We use average of the 
word vectors of a given sentence as features in 
an ^ 2 -regularized logistic regression for classifica¬ 
tion. The classifier is tuned on the dev set and ac¬ 
curacy is reported on the test set. 


^http://WWW.gutenberg.org/ebooks/10681 
^https://archive.org/details/synonymsant 


NP-Bracketing. Lazaridou et al. (201 3| l con¬ 
structed a dataset from the Penn TreeBank 
dMarcus et al., 199^ of noun phrases (NP) of 
length three words, where the first can be an 
adjective or a noun and the other two are nouns. 
The task is to predict the correct bracketing in the 
parse tree for a given noun phrase. For example, 
local (phone company) and (blood pressure) 
medicine exhibit left and right bracketing respec¬ 
tively. We append the word vectors of the three 
words in the NP in order and use them as features 
in an ^ 2 -regularized logistic regression classifier. 
The datasef contains 2,227 noun phrases split into 
10 folds. The classifier is tuned on the first fold 
and cross-validation accuracy is reported on the 
o rrpnamtagiiaiaa i olds. 































Vector 

Length (D) 

Params. 

Corpus Size 

WS-353 

RG-65 

SimLex 

Send 

NP 

Skip-Gram 

300 

Dx N 

300 billion 

65.6 

72.8 

43.6 

81.5 

80.1 

Glove 

300 

D X N 

6 billion 

60.5 

76.6 

36.9 

77.7 

77.9 

LSA 

300 

Dx N 

1 billion 

67.3 

77.0 

49.6 

81.1 

79.7 

Ling Sparse 

172,418 

- 

- 

44.6 

77.8 

56.6 

79.4 

83.3 

Ling Dense 

300 

D X N 

- 

45.4 

67.0 

57.8 

75.4 

76.2 

Skip-Gram © Ling Sparse 

172,718 

- 

- 

67.1 

80.5 

55.5 

82.4 

82.8 


Table 3: Performance of different type of word vectors on evaluation tasks reported by Spearman’s 
correlation (first 3 columns) and Accuracy (last 2 columns). Bold shows the best performance for a task. 


3.2 Linguistic Vs. Distributional Vectors 

In order to make our linguistic vectors comparable 
to publicly available distributional word vectors, 
we perform singular value decompostion (SVD) 
on the linguistic matrix to obtain word vectors of 
lower dimensionality. If L e {0,1}^^^ is the lin¬ 
guistic matrix with N word types and D linguistic 
features, then we can obtain U G from the 

SVD of L as follows: L = USIV^, with K being 
the desired length of the lower dimensional space. 

We compare both sparse and dense lin¬ 
guistic vectors to three widely used distribu¬ 
tional word vector models. The first two are 
the pre-trained Skip-Gram dMikolov et al., 20T3] ^ 
and Glove (Pennington et al., 2014 ^ word vectors 
each of length 300, trained on 300 billion and 6 
billion words respectively. We used latent seman¬ 
tic analysis (LSA) to obtain word vectors from 
the SVD decomposition of a word-word cooc¬ 
currence matrix ( Turney and Pantel, 2010[ ). These 
were trained on 1 billion words of Wikipedia with 
vector length 300 and context window of 5 words. 

3.3 Results 

Table [3] shows the performance of different word 
vector types on the evaluation tasks. It can be 
seen that although Skip-Gram, Glove & LSA per¬ 
form better than linguistic vectors on WS-353, 
the linguistic vectors outperform them by a huge 
margin on SimLex. Linguistic vectors also per¬ 
form better at RG-65. On sentiment analysis, lin¬ 
guistic vectors are competitive with Skip-Gram 
vectors and on the NP-bracketing task they out¬ 
perform all distributional vectors with a statisti¬ 
cally significant margin (p < 0.05, McNemar’s test 
IDietterich (1998| )). We append the sparse linguis¬ 
tic vectors to Skip-Gram vectors and evaluate the 
resultant vectors as shown in the bottom row of 
Table |3l The combined vector outperforms Skip- 
Gram on all tasks, showing that linguistic vectors 


contain useful information orthogonal to distribu¬ 
tional information. 

It is evident from the results that linguistic vec¬ 
tors are either competitive or better to state-of-the- 
art distributional vector models. Sparse linguis¬ 
tic word vectors are high dimensional but they are 
also sparse, which makes them computationally 
easy to work with. 

4 Discussion 

Linguistic resources like WordNet have found 
extensive applications in lexical semantics, for 
example, for word sense disambiguation, word 
similarity etc. dResnik, 1995t Agirre et al., 2009| |. 
Recently there has been interest in using linguistic 
resources to enrich word vector representations. 
In these approaches, relational information 
among words obtained from WordNet, Freebase 
etc. is used as a constraint to encourage words 
with similar properties in lexical ontologies 
to have similar word vectors ( |Xu et ah, 2014t 
Yu and Dredze, 20T4t |Bian et al., 2014 


Fried and Duh, 20l4l Fai'uqui et al., 2015a |. Dis 


tributional representations have also been shown 
to improve by using experiential data in addition 
to distributional context ( [Andrews et al., 2d09| |. 
We have shown that simple vector concatenation 
can likewise be used to improve representations 
(further confirming fhe esfablished finding fhaf 
lexical resources and cooccurrence information 
provide somewhaf orthogonal information), buf if 
is certain fhaf more careful combinafion sfrafegies 
can be used. 

Alfhough disfribufional word vector di¬ 
mensions cannof, in general, be idenfified 
wifh linguisfic properfies, if has been shown 
fhaf some vector consfrucfion sfrafegies yield 
dimensions fhaf are relafively more infer- 


prefable (Murphy ef al., 2012; Fyshe el al., 2014 


Fyshe ef al., 2015t Faruqui ef al., 2015bl l. How- 


'https: / /code. google. com/p/word2vec ever, such analysis is difficuh to generalize across 

'^http: //www-nip. stanford.edu/pro jects/gioAmpdels of represenlalion. In conslrasl to dislribu- 
































































tional word vectors, linguistic word vectors have 
interpretable dimensions as every dimension is a 
linguistic property. 

Linguistic word vectors require no training as 
there are no parameters to be optimized, meaning 
they are computationally economical. While good 
quality linguistic word vectors may only be ob¬ 
tained for languages with rich linguistic resources, 
such resources do exist in many languages and 
should not be disregarded. 

5 Conclusion 

We have presented a novel method of constructing 
word vector representations solely using linguistic 
knowledge from pre-existing linguistic resources. 
These non-distributional, linguistic word vectors 
are competitive to the current models of distribu¬ 
tional word vectors as evaluated on a battery of 
tasks. Linguistic vectors are fully interpretable 
as every dimension is a linguistic feature and are 
highly sparse, so they are computationally easy to 
work with. 
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